Information Extraction with and without Parsing Semi-structured Documents

نویسندگان

  • Daisuke Ikeda
  • Yasuhiro Yamada
چکیده

Information extraction from semi-structured documents comprises contents detection, wrapper generation and schema extraction. The contents detection step corresponds to making training examples in wrapper induction based on machine learning and the schema extraction identifies extracted data types. We formulate the contents detection using the repetitive pattern introduced in this paper. That is, we define the contents detection problem which is, given strings generated by some repetitive pattern, to find constant strings of the pattern. Then, we develop a linear time algorithm for the problem. This implies that it is not necessary for the algorithm to parse given documents except for a saw which is a set of characters pointing the boundaries of the contents and the other parts of documents. The necessity of the saw is also supported by experiments. The parsing, however, is shown to play an important role in the schema extraction step.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Space characters in Chinese semi-structured texts

Space characters can have an important role in disambiguating text. However, few, if any, Chinese information extraction systems make full use of space characters. However, it seems that treatment of space characters is necessary, especially in cases of extracting information from semi-structured documents. This investigation aims to address the importance of space characters in Chinese informa...

متن کامل

Xtractor: A light wrapper for XML paragraph-centric documents

The emergence of XML leads the development of applications centric XML-documents. Often the documents contain tagged paragraphs of natural language texts. The extraction of relevant data from paragraphs confronts with their irregular structure hidden in the text and requires powerful extraction patterns. Although a large spectrum of wrappers has been conceived to mainly process HTML pages, the ...

متن کامل

Kernels for Semi-Structured Data

Semi-structured data such as XML and HTML is attracting considerable attention. It is important to develop various kinds of data mining techniques that can handle semistructured data. In this paper, we discuss applications of kernel methods for semistructured data. We model semi-structured data by labeled ordered trees, and present kernels for classifying labeled ordered trees based on their ta...

متن کامل

Semi-structured Information Extraction Applying Automatic Pattern Discovery

Information extraction (IE) from semi-structured Web documents is a critical issue for information integration systems on the Internet. Previous work in wrapper induction aim to solve this problem by applying machine learning to automatically generate extractors. For example, WIEN, Stalker, Softmealy, etc. However, this approach still requires human intervention to provide training examples. He...

متن کامل

Header Metadata Extraction from Semi-structured Documents Using Template Matching

With the recent proliferation of documents, automatic metadata extraction from document becomes an important task. In this paper, we propose a novel template matching based method for header metadata extraction form semi-structured documents stored in PDF. In our approach, templates are defined, and the document is considered as strings with format. Templates are used to guide finite state auto...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004